Method, system, and apparatus for adjacent-symbol error correction and detection code

ABSTRACT

A circuit and method for generating an Error Correcting Code (ECC) based on an adjacent symbol codeword that is formed in two clock phases.

BACKGROUND

1. Field

The present disclosure pertains to the field of memory and computermemory systems and more specifically to error detection and correctionfor memory errors.

2. Description of Related Art

Error correcting codes (ECC) have been routinely used for faulttolerance in computer memory subsystems. The most commonly used codesare the single error correcting (SEC) and double error detecting (DED)codes capable of correcting all single errors and detecting all doubleerrors in a code word.

As the trend of chip manufacturing is toward a larger chip capacity,more and more memory subsystems will be configured in b-bits-per-chip.The most appropriate symbol ECC to use on the memory are the singlesymbol error correcting (SbEC) and double symbol error detecting (DbED)codes, wherein “b” is the width (number of bits in output) of the memorydevice, that correct all single symbol errors and detect all doublesymbol errors in a code word. A memory designed with a SbEC-DbED codecan continue to function when a memory chip fails, regardless of itsfailure mode. When there are two failing chips that line up in the sameECC word sometime later, the SbEC-DbED code would provide the necessaryerror detection and protect the data integrity for the memory.

Existing and imminent memory systems utilize eighteen memory devices.However, the present SbEC-DbED error correcting codes utilize 36 memorydevices in order to provide chipfail correction and detection. Thus, thecost increases due to the added expense of 36 memory devices for errorcorrecting purposes and they are inflexible because they do not scale(adapt) to the memory systems with eighteen memory devices. Furthermore,the various circuits for encoding and decoding the errors are complex.Thus, this increases the cost and design of computer systems to insuredata integrity.

BRIEF DESCRIPTION OF THE FIGURES

The present invention is illustrated by way of example and notlimitation in the Figures of the accompanying drawings.

FIG. 1 illustrates a block diagram of a code word utilized in anembodiment.

FIG. 2 illustrates an apparatus utilized in an embodiment.

FIG. 3 illustrates a flowchart of a method utilized in an embodiment.

FIG. 4 illustrates an apparatus utilized in an embodiment described inconnection with FIG. 2.

FIG. 5 illustrates a system utilized in an embodiment.

DETAILED DESCRIPTION

The following description provides a method, apparatus, and system forerror detection and correction of memory devices. In the followingdescription, numerous specific details are set forth in order to providea more thorough understanding of the present invention. It will beappreciated, however, by one skilled in the art that the invention maybe practiced without such specific details. Those of ordinary skill inthe art, with the included descriptions, will be able to implementappropriate logic circuits without undue experimentation.

As previously described, typical ECC code utilizes 36 memory devices forchipfail detection and correction that results in increased cost anddesign of a computer system. Also, with the advent of eighteen memorydevices in a system, the present ECC codes do not scale. In contrast,the claimed subject matter facilitates a new ECC code, “adjacent-symbol”code that supports memory systems with 18 memory devices For example, inone embodiment, the claimed subject matter facilitates the ability fordecoding and correcting memory errors in systems that utilize 18 memorydevices for a memory transaction (memory rank). Furthermore, the claimedsubject matter facilitates forming a code word of data with only twoclock phases. Also, the adjacent-symbol ECC code corrects any errorpattern within the data from one memory device and detects variouserrors (double device errors) from failures in 2 memory devices.

In one embodiment, the adjacent-symbol ECC code is utilized for a memorysystem with two channels of Double Data Rate (DDR) memory, wherein eachchannel is 64 bits wide with eight optional bits for ECC. Also, thememory system may utilize x4 or x8 wide memory devices (x4 and x8 refersto the number of bits that can be output from the memory device). Thus,the claimed subject matter supports various configurations of memorysystems. For example, a memory system with x8 devices would utilize 18memory devices per memory rank if ECC is supported, otherwise, 16 memorydevices per memory rank if ECC is not supported. Alternatively, a memorysystem with x4 devices would utilize 36 memory devices-per memory rankif ECC is supported, otherwise, 32 memory devices per memory rank if ECCis not supported.

FIG. 1 illustrates a block diagram of a code word utilized in anembodiment. The block diagram 100 comprises an adjacent symbol codeword106 to be formed from two clock phases of data 102 and 104 from a memorydevice. For example, in one embodiment, a memory access transactioncomprises a transfer of 128 data bits plus an additional 16 ECC checkbits per clock edge, for a total of 144 bits for each clock edge (288bits for both clock edges). In a first clock phase 102, a first nibble“n0” and a second nibble “n2” of data from a memory are transferred andmapped to a first nibble of each of two symbols of the codeword 106.Subsequently, during a second clock phase 104, a first nibble “n1” and asecond nibble “n3” from a memory are transferred and mapped to a secondnibble of each of two symbols of the codeword 106. Thus, the two symbolsof the codeword 106 are adjacent and are on a 16 bit boundary of thecode word, which are designated as “adjacent symbols”, thus, thecodeword 106 is an adjacent symbol codeword.

The scheme illustrated in the block diagram facilitates error detectionand improves fault coverage of common mode errors. For example, for anx4 memory device, there is a one to one mapping of nibbles from the x4memory device to a symbol in the underlying code word. In contrast, fora x8 memory device, there is a one to one mapping of nibbles from halfof the x8 memory device to a symbol in the underlying code word Thus,the claimed subject matter facilitates isolating common mode errorsacross nibbles to the symbol level and results in increased faultcoverage. Therefore, for the x8 memory device, the claimed subjectmatter precludes aliasing for a second device failure. Likewise, deviceerrors in the x4 memory devices are isolated to a single symbol in thecodeword 106, thus, there is complete double device coverage for the x4memory devices.

To further illustrate, there are typically two classes of double devicefailures, simultaneous and sequence, that occur in the same memory rank.

A simultaneous double device failure has no early sign warning becausethere is no indication of an error in a previous memory transaction.Typically, the computer system reports an uncorrectable error in theabsence of an aliasing. However, the system might incorrectly report acorrectable single device failure. This time the aliasing may bediscovered in subsequent accesses because an error pattern might changeas to preclude the alias.

In contrast, a sequential double device failure is a more typicalfailure pattern than a simultaneous double device failure. Typically,the first device error is detected as a correctable error. For a seconddevice failure, there may be two outcomes in one embodiment; the erroris reported as uncorrectable, otherwise, the error is reported as acorrectable error at a new location. In the event of an uncorrectableerror for the second device failure, the analysis is complete.Otherwise, the system changes the error location from the first devicefailure to the second device's failure location. Therefore, thepreceding method for detecting the alias is accurate because it isunlikely that the first device failure location resolves itself and evenless likely that is does at the simultaneous instant that the seconddevice failure has failed.

A few examples of double device errors that are always detected (noaliasing) are double bit errors, double wire faults, wire faults in onememory device with a single bit error in a second memory device, and afault that affects only one nibble of each memory device.

In one example of a device error for the x8 memory device, all 16 bitsof the codeword (adjacent symbols) may be affected (corrupted) becausethe failure results in an error for both nibbles and both clock phasesof the memory device's data. Thus, the claimed subject matterfacilitates the correction of this device failure by first correctingthe 16 bits that are in error. However, in the event of a second memorydevice failure, the code detects the error pattern in two groups of 16bits which are aligned on 16-bit boundaries in the code word 106.

FIG. 2 illustrates an apparatus utilized in an embodiment. From ahigh-level perspective, the apparatus generates a code word by creatingcheck bits to be appended to data that is forwarded to memory.Subsequently, the apparatus generates a syndrome based at least in parton decoding the code word received from memory and facilitatesclassifying errors and correcting the errors. In one embodiment, thecode word from the memory device is an adjacent symbol codeword that wasdescribed in connection with FIG. 1.

The apparatus comprises an encoder circuit 202, at least one memorydevice 204, a decoder circuit 206, an error classification circuit 208,and a correction circuit 210.

The encoder circuit receives data that is to be forwarded to the memorydevice or memory devices 204. The encoder circuit generates a pluralityof check bits based at least in part on the data. Thus, a codeword isformed based at least in part on the plurality of check bits and thedata and is forwarded to the memory device or memory devices 204.

In one embodiment, the check bits are generated from the binary form ofa G-matrix, wherein the matrix has 32 rows and 256 columns to form 32check bits. The check bits are computed as follows:c _(i) =Σ d _(j) ×G _(ij) for i=0 to 31 and j=0 to 255

For binary data, the multiply operation becomes an AND function and thesum operation the 1-bit sum or XOR operation. Thus, the resultingencoding circuit comprises 32 XOR, each tree computing one of the 32check bits.

Subsequently, the memory device or memory devices 104 returns data andthe check bits back to the decoder circuit 106. In one embodiment, thedecoder circuit generates a 32-bit syndrome based at least in part on a288-bit code word (as earlier described in connection with FIG. 1 forthe 288-bit code word).

In one embodiment, the syndrome is generated from an H-matrix, whereinthe matrix comprises 32 rows and 288 columns. Each syndrome bit iscalculated as follows:s _(i) =Σ v _(j) ×H _(ij) for i=0 to 31 and j=0 to 287

As previously described with the encoder circuit, the generation of thesyndrome bits is simplified to a XOR operation over the code word bitscorresponding to the columns of the H-matrix that have a binary 1 value.Thus, the decoding circuit comprises 32-XOR trees, each tree computingone of the 32 syndrome bits. Therefore, in one embodiment, a 32 bitsyndrome is generated by an H matrix receiving a 288 bit codeword.However, the claimed subject matter is not limited to this bitconfiguration. One skilled in the art appreciates modifications to thesize of the syndrome and codeword.

The error classification and error correction are described inconnection with FIG. 4.

FIG. 3 depicts a flowchart for a method utilized in an embodiment. Theflowchart depicts a method for detecting whether there were errors indata in a transaction with a memory device or devices. A first block 302generates check bits to be appended to data for forwarding to a memorydevice or devices. An adjacent symbol codeword is generated based atleast in part on data received from the memory device or devices to beutilized for checking the integrity of the data, as depicted by a block304. A decoder generates a syndrome based at least in part on theadjacent symbol codeword, as depicted by a block 306. In the presence ofan error as determined by the syndrome, an error classification andcorrection is performed, as depicted by a block 308.

FIG. 4 illustrates an apparatus utilized in an embodiment described inconnection with FIG. 2. As previously described, FIG. 4 describes oneembodiment of the error classification and error correction inconnection with FIG. 2.

The error classification is based at least in part on the decodingcircuit's computation of the syndrome. For example, in one embodiment,if the syndrome (S)==0, then there is NO error. Otherwise, if thesyndrome (S)>0, there is an error. Also, it is optional to furtherclassify the error by computing an error location vector L. For example,in one embodiment, the error is uncorrectable if L==0. Otherwise, theerror is correctable in an indicated column if L>0. Furthermore, one mayfurther classify the correctable error as whether the error occurs in adata column or check column. For example, if the error is in a checkcolumn, the data portion of the code word may bypass the correctionlogic.

In yet another embodiment, a single device correctable error may beclassified based at least in part on a weight of the error value. Asdepicted in FIG. 4, an adjacent pair may generate error values e₀ ande₁. Thus, the error locator vector L is then used to gate the errorvalues on a plurality of busses, 402 and 404 because the circuits allowfor the error locator bits for one adjacent pair will be enabled for agiven error pattern.

Thus, the claimed subject matter allows for test coverage of both singleand double device errors.

FIG. 5 depicts a system in accordance with one embodiment. The system inone embodiment is a processor 502 that is coupled to a chipset 504 thatis coupled to a memory 506. For example, the chipset performs andfacilitates various operations, such as, memory transactions between theprocessor and memory and verifies the data integrity by utilizing theadjacent symbol codeword as described in connection with FIG. 1. In oneembodiment, the chipset is a server chipset to support a computer serversystem. In contrast, in another embodiment, the chipset is a desktopchipset to support a computer desktop system. In both previousembodiments, the system comprises the previous embodiments depicted inFIGS. 1-4 of the specification to support the adjacent symbol codewordand error correction and detection methods and apparatus.

While certain exemplary embodiments have been described and shown in theaccompanying drawings, it is to be understood that such embodiments aremerely illustrative of and not restrictive on the broad invention, andthat this invention not be limited to the specific constructions andarrangements shown and described, since various other modifications mayoccur to those ordinarily skilled in the art upon studying thisdisclosure.

1. A method for forming an adjacent symbol codeword comprising: generating a set of m bits, wherein m is an integer, of a first symbol and a set of m bits of a second symbol from a first set of data during a first clock phase; and generating a set of n bits, wherein n is an integer, of the first symbol and a set of n bits of the second symbol from a second set of data during a second clock phase.
 2. The method of claim 1 wherein the first and second set of data is from a memory.
 3. The method of claim 2 wherein the memory is a Double Data Rate (DDR) memory.
 4. The method of claim 1 wherein both m and n are four bits and constitute a nibble.
 5. The method of claim 1 wherein the adjacent symbol codeword comprises an adjacent formation of the first and second symbol.
 6. The method of claim 1 further comprising isolating a common mode error across the m and n bits of the first and second symbol.
 7. A method for testing a memory comprising: generating a plurality of check bits to append to data that is forwarded to the memory; generating an adjacent symbol codeword based at least in part on data received from the memory; decoding the adjacent symbol codeword; and determining whether an error exists in the memory
 8. The method of claim 7 wherein decoding the adjacent symbol codeword comprises generating a syndrome based at least in part on the adjacent symbol codeword.
 9. The method of claim 7 wherein determining whether an error exists in the memory is based at least in part on the syndrome.
 10. The method of claim 9 wherein an error exists based on the syndrome, further comprising: classifying the error in the received data; and correcting the error in the received data.
 11. The method of claim 7 wherein the memory is a Double Data Rate (DDR) memory.
 12. The method of claim wherein the syndrome is thirty two bits based on a two hundred eighty eight bit codeword.
 13. An apparatus for an Error Correcting Code comprising: a first logic to generate a plurality of check bits based on a set of data, to append the check bits to the set of data that is to be forwarded to a memory; a second logic to receive a codeword from the memory and to generate a syndrome based on the codeword, and to detect whether an error exists based on the syndrome; a third logic to classify the error if it exists; and a fourth logic to correct the error if it exists.
 14. The apparatus of claim 13 is incorporated within a server chipset.
 15. The apparatus of claim 13 wherein the memory is a Double Data Rate (DDR) memory.
 16. The apparatus of claim 13 wherein the syndrome is thirty two bits and the codeword is 288 bits.
 17. The apparatus of claim 13 wherein the first logic is an encoder and utilizes the formula: c _(i) =Σ d _(j) ×G _(ij) for i=0 to 31 and j=0 to 255,to generate the plurality of check bits.
 18. The apparatus of claim 13 wherein the second logic is a decoder and the syndrome is an H matrix that is generated by the formula: s _(i) =Σ v _(j) ×H _(ij) for i=0 to 31 and j=0 to 287,to generate the syndrome.
 19. An apparatus to classify an error from a memory comprising: a first logic to generate an H matrix syndrome; and to determine whether an error exists based on the syndrome, if so, to classify an error type of the error.
 20. The apparatus of claim 19, wherein the H matrix syndrome is generated by the formula: s _(i) =Σ v _(j) ×H _(ij) for i=0 to 31 and j=0 to
 287. 21. The apparatus of claim 19 wherein to classify the error type comprises the first logic to generate an error location vector.
 22. The apparatus of claim 21 wherein the error location vector determines whether the error is correctable: a value of zero in the error location vector indicates the error is uncorrectable; in contrast, a value that is greater than zero in an indicated column of the error location vector indicates the error is correctable.
 23. The apparatus of claim 19 wherein the error type may be either a single device error or a double device error.
 24. The apparatus of claim 23 wherein the double device error is either of a simultaneous error or a sequential error.
 25. The apparatus of claim 23 wherein the single device error is classified based on a weight of an error value, e₀ and e₁, and the error location vector is to gate the error values.
 26. A system comprising: a processor, coupled to a memory and a chipset, to generate an operation to the memory via the chipset; and the chipset to utilize an Error Correcting Code (ECC) based on an adjacent symbol codeword that is formed in two clock phases and to determine whether an error exists in a plurality of data received by the chipset from the memory, if so, to classify a type of the error based on an H matrix syndrome.
 27. The system of claim 26 wherein the memory is a Double Data Rate (DDR) memory.
 28. The system of claim 26 wherein the H matrix syndrome is generated by the formula: s _(i) =Σ v _(j) ×H _(ij) for i=0 to 31 and j=0 to
 287. 29. The system of claim 26 wherein the system is a server. 